Introduction to Triton Programming: The Parallel Execution Model: Thinking in Blocks

Transitioning from serial CPU programming to GPU programming requires a paradigm shift: from element-wise iteration to block-based execution. We no longer view data as a stream of scalars, but as collections of "blocks" scheduled to saturate hardware bandwidth.

1. Memory-Bound vs. Compute-Bound

A kernel's bottleneck is determined by the ratio of math operations to memory accesses. Vector-add is often memory-bound because it performs only one addition for every three memory operations (2 loads, 1 store). The hardware spends more time waiting for DRAM than calculating.

2. The Role of BLOCK_SIZE

BLOCK_SIZE defines the granularity of parallelism. If it's too small, we underutilize the GPU's wide execution lanes. An optimal size ensures enough "work in flight" to saturate the memory bus.

3. Latency Hiding through Occupancy

Occupancy is the number of active blocks on the GPU. While not the ultimate goal, it allows the scheduler to swap in a new block to perform math while another waits for high-latency memory fetches from VRAM.

4. Hardware Utilization

To maximize performance, we must align our BLOCK_SIZE with the GPU architecture's memory coalescing rules, ensuring that consecutive threads access consecutive memory addresses.

TERMINAL bash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

For a kernel that adds two vectors ($out = x + y$), what is the most likely bottleneck on modern GPUs?

Arithmetic Throughput

Memory Bandwidth

Shared Memory Latency

QUESTION 2

What is the primary purpose of 'Occupancy' in the GPU execution model?

To ensure every thread runs as fast as possible.

To hide memory latency by keeping work in flight.

To increase the clock speed of the compute units.

To reduce the power consumption of the HBM.

QUESTION 3

Which of the following describes 'Memory-Bound' behavior?

The GPU is waiting for the memory bus to deliver data.

The GPU has exhausted its available VRAM.

The kernel is performing too many complex floating-point operations.

The CPU cannot launch kernels fast enough.

QUESTION 4

What happens if the BLOCK_SIZE is set too small?

The kernel will fail with a memory error.

The GPU fails to utilize its wide SIMD execution lanes.

The memory bandwidth increases significantly.

QUESTION 5

In the logistics warehouse analogy, what represents the 'Blocks'?

The individual items.

The workers.

The organized pallets.

The delivery trucks.

Case Study: Bottleneck Analysis

Identifying Kernel Constraints

You are profiling three kernels: a Vector Addition kernel, a Deep Matrix Multiplication (GEMM) kernel, and a kernel that performs ReLU on a matrix. You need to categorize their bottlenecks based on hardware utilization theory.

1. For each kernel (Vector Add, Matrix Multiply, 4-element Vector Add), decide whether the bottleneck is likely arithmetic throughput, memory bandwidth, or launch overhead.

Solution:
1. **Vector Addition**: Memory Bandwidth (low math-to-memory ratio). 2. **Deep Matrix Multiply**: Arithmetic Throughput (high $O(N^3)$ compute vs $O(N^2)$ memory). 3. **4-element Vector Add**: Launch Overhead (the time to start the GPU kernel outweighs the tiny workload).

2. Determine the bottleneck for a ReLU operation on a large matrix.

Solution:
The bottleneck for **ReLU** on a matrix is **Memory Bandwidth**. Since the operation is a simple comparison ($max(0, x)$), it is extremely computationally cheap, meaning performance is dictated by how fast the GPU can read the matrix from and write it back to global memory.